Bayesian Statistics


Lecture 05

February 5, 2024

Review

Last Class(es)

  • Probability models for dynamical systems/simulation models
  • Generative model: can include discrepancy and/or observational errors
    • Model data, not expected value (regression)
  • Maximize likelihood over model and statistical parameters.

Non-Uniqueness of MLE

  • Many models do not have well-defined maximum likelihoods.

Non-Identifiability

\[\underbrace{h_t}_{\substack{\text{hare} \\ \text{pelts}}} \sim \text{LogNormal}(\log(\underbrace{p_H}_{\substack{\text{trap} \\ \text{rate}}} H_T), \sigma_H)\] \[l_t \sim \text{LogNormal}(\log(p_L L_T), \sigma_L)\]

\[ \begin{align*} \frac{dH}{dt} &= H_t b_H - H_t (L_t m_H) \\ H_T &= H_1 + \int_1^T \frac{dH}{dt}dt \end{align*} \]

\[ \begin{align*} \frac{dL}{dt} &= L_t (H_t b_L) - L_t m_L \\ L_T &= L_1 + \int_1^T \frac{dL}{dt}dt \end{align*} \]

Non-Uniqueness of MLE

  • Many models do not have well-defined maximum likelihoods.
  • Can be due to multi-modality or “ridges”.
  • Sometimes also referred to as equifinality.
  • Poses problems for MLE.

Bayesian Statistics

Prior Information

So far: no way to use prior information about parameters (other than bounds on MLE optimization).

For example: what “trap rates” are more plausible?

Bayes’ Rule

Original version (Bayes, 1763):

\[P(A | B) = \frac{P(B | A) \times P(A)}{P(B)} \quad \text{if} \quad P(B) \neq 0.\]

Bayes’ Rule

“Modern” version (Laplace, 1774):

\[\underbrace{{p(\theta | y)}}_{\text{posterior}} = \frac{\overbrace{p(y | \theta)}^{\text{likelihood}}}{\underbrace{p(y)}_\text{normalization}} \overbrace{p(\theta)}^\text{prior}\]

Bayes’ Rule (Ignoring Normalizing Constants)

The version of Bayes’ rule which matters the most for 95% (approximate) of Bayesian statistics:

\[p(\theta | y) \propto p(y | \theta) \times p(\theta)\]

“The posterior is the prior times the likelihood…”

Credible Intervals

Bayesian credible intervals are straightforward to interpret: \(\theta\) is in \(I\) with probability \(\alpha\).

Choose \(I\) such that \[p(\theta \in I | \mathbf{y}) = \alpha.\]

Bayesian Model Components

A fully specified Bayesian model includes:

  1. Prior distributions over the parameters, \(p(\theta)\)
  2. Probability model for the data given the parameters (the likelihood), \(p(y | \theta)\)t

Think: Prior provides proposed explanations, likelihood re-weights based on ability to produce the data.

Generative Modeling

Bayesian models lend themselves towards generative simulation by generating new data \(\tilde{y}\) through the posterior predictive distribution:

\[p(\tilde{y} | \mathbf{y}) = \int_{\Theta} p(\tilde{y} | \theta) p(\theta | \mathbf{y}) d\theta\]

How To Choose A Prior?

One perspective: Priors should reflect “actual knowledge” independent of the analysis (Jaynes, 2003)

Another: Priors are part of the probability model, and can be specified/changed accordingly based on predictive skill (Gelman et al., 2017; Gelman & Shalizi, 2013)

What Makes A Good Prior?

  • Reflects level of understanding (informative vs. weakly informative vs. non-informative).
  • Does not zero out probability of plausible values.
  • Regularization (extreme values should be less probable)

What Makes A Bad Prior?

  • Assigns probability zero to plausible values;
  • Weights implausible values equally as more plausible ones;
  • Double counts information (e.g. fitting a prior to data which is also used in the likelihood)
  • Chosen based on vibes.
  • Personal opinion: Uniform distributions

A Coin Flipping Example

We would like to understand if a coin-flipping game is fair. We’ve observed the following sequence of flips:

flips = ["H", "H", "H", "T", "H", "H", "H", "H", "H"]
9-element Vector{String}:
 "H"
 "H"
 "H"
 "T"
 "H"
 "H"
 "H"
 "H"
 "H"

Coin Flipping Likelihood

The data-generating process here is straightforward: we can represent a coin flip with a heads-probability of \(\theta\) as a sample from a Bernoulli distribution,

\[y_i \sim \text{Bernoulli}(\theta).\]

flip_ll(θ) = sum(logpdf(Bernoulli(θ), flips .== "H"))
θ_mle = Optim.optimize-> -flip_ll(θ), 0, 1).minimizer
round(θ_mle, digits=2)
0.89

Coin Flipping Prior

Suppose that we spoke to a friend who knows something about coins, and she tells us that it is extremely difficult to make a passable weighted coin which comes up heads more than 75% of the time.

Coin Flipping Prior

Since \(\theta\) is bounded between 0 and 1, we’ll use a Beta distribution for our prior, specifically \(\text{Beta}(4,4)\).

Code
prior_dist = Beta(5, 5)
plot(prior_dist; label=false, xlabel=L"$θ$", ylabel=L"$p(θ)$", linewidth=3, tickfontsize=16, guidefontsize=18)
plot!(size=(500, 500))
Figure 1: Beta prior for coin flipping example

Maximum A Posteriori Estimate

Combining using Bayes’ rule lets us calculate the maximum a posteriori (MAP) estimate:

flip_ll(θ) = sum(logpdf(Bernoulli(θ), flips .== "H"))
flip_lprior(θ) = logpdf(Beta(5, 5), θ)
flip_lposterior(θ) = flip_ll(θ) + flip_lprior(θ)
θ_map = Optim.optimize-> -(flip_lposterior(θ)), 0, 1).minimizer
round(θ_map, digits=2)
0.71

Coin Flipping Posterior Distribution

Code
θ_range = 0:0.01:1
plot(θ_range, flip_lposterior.(θ_range), color=:black, label="Posterior", linewidth=3, tickfontsize=16, legendfontsize=16, guidefontsize=18, bottom_margin=5mm, left_margin=5mm)
vline!([θ_map], color=:red, label="MAP", linewidth=2)
vline!([θ_mle], color=:blue, label="MLE", linewidth=2)
xlabel!(L"$\theta$")
ylabel!("Posterior Density")
plot!(size=(1000, 450))
Figure 2: Posterior distribution for the coin-flipping example

Bayes and Parametric Uncertainty

Frequentist: Parametric uncertainty is purely the result of sampling variability

Bayesian: Parameters have probabilities based on consistency with data and priors.

Think: how “likely” is a set of parameters to have produced the data given the specified data generating process?

Bayesian Updating

  • The posterior is a “compromise” between the prior and the data.
  • The posterior mean is a weighted combination of the data and the prior mean.
  • The weights depend on the prior and the likelihood variances.
  • More data usually makes the posterior more confident.

Example: Local Sea Level Extremes

San Francisco Tide Gauge Data

Code
# read in data and get annual maxima
function load_data(fname)
    date_format = DateFormat("yyyy-mm-dd HH:MM:SS")
    # This uses the DataFramesMeta.jl package, which makes it easy to string together commands to load and process data
    df = @chain fname begin
        CSV.read(DataFrame; header=false)
        rename("Column1" => "year", "Column2" => "month", "Column3" => "day", "Column4" => "hour", "Column5" => "gauge")
        # need to reformat the decimal date in the data file
        @transform :datetime = DateTime.(:year, :month, :day, :hour)
        # replace -99999 with missing
        @transform :gauge = ifelse.(abs.(:gauge) .>= 9999, missing, :gauge)
        select(:datetime, :gauge)
    end
    return df
end

dat = load_data("data/surge/h551.csv")

# detrend the data to remove the effects of sea-level rise and seasonal dynamics
ma_length = 366
ma_offset = Int(floor(ma_length/2))
moving_average(series,n) = [mean(@view series[i-n:i+n]) for i in n+1:length(series)-n]
dat_ma = DataFrame(datetime=dat.datetime[ma_offset+1:end-ma_offset], residual=dat.gauge[ma_offset+1:end-ma_offset] .- moving_average(dat.gauge, ma_offset))

# group data by year and compute the annual maxima
dat_ma = dropmissing(dat_ma) # drop missing data
dat_annmax = combine(dat_ma -> dat_ma[argmax(dat_ma.residual), :], groupby(transform(dat_ma, :datetime => x->year.(x)), :datetime_function))
delete!(dat_annmax, nrow(dat_annmax)) # delete 2023; haven't seen much of that year yet
rename!(dat_annmax, :datetime_function => :Year)
select!(dat_annmax, [:Year, :residual])
dat_annmax.residual = dat_annmax.residual / 1000 # convert to m

# make plots
p1 = plot(
    dat_annmax.Year,
    dat_annmax.residual;
    xlabel="Year",
    ylabel="Annual Max Tide Level (m)",
    label=false,
    marker=:circle,
    markersize=5,
    tickfontsize=16,
    guidefontsize=18
)
p2 = histogram(
    dat_annmax.residual,
    normalize=:pdf,
    orientation=:horizontal,
    label=:false,
    xlabel="PDF",
    ylabel="",
    yticks=[],
    tickfontsize=16,
    guidefontsize=18
)

l = @layout [a{0.7w} b{0.3w}]
plot(p1, p2; layout=l, link=:y, ylims=(1, 1.7), bottom_margin=5mm, left_margin=5mm)
plot!(size=(1000, 450))
Figure 3: Annual maxima surge data from the San Francisco, CA tide gauge.

Proposed Probability Model

\[ \begin{align*} & y \sim LogNormal(\mu, \sigma) \tag{likelihood}\\ & \left. \begin{aligned} & \mu \sim Normal(0, 1) \\ & \sigma \sim HalfNormal(0, 5) \end{aligned} \right\} \tag{priors} \end{align*} \]

Want to find:

\[p(\mu, \sigma | y) \propto p(y | \mu, \sigma) p(\mu)p(\sigma)\]

Are Our Priors Reasonable?

Key idea: what do the priors imply for observable variables?

Let’s simulate data from the prior predictive distribution to see we get plausible outcomes.

\[y \sim p(\tilde{y}) = \int_{\Theta} p(\tilde{y} | \theta) p(\theta) d\theta\]

Prior Predictive Check

Code
# sample from priors
μ_sample = rand(Normal(0, 1), 1_000)
σ_sample = rand(truncated(Normal(0, 5), 0, +Inf), 1_000)

# define return periods and cmopute return levels for parameters
return_periods = 2:100
return_levels = zeros(1_000, length(return_periods))
for i in 1:1_000
    return_levels[i, :] = quantile.(LogNormal(μ_sample[i], σ_sample[i]), 1 .- (1 ./ return_periods))
end

plt_prior_1 = plot(; yscale=:log10, yticks=10 .^ collect(0:2:16), ylabel="Return Level (m)", xlabel="Return Period (yrs)",
    tickfontsize=16, legendfontsize=18, guidefontsize=18, bottom_margin=10mm, left_margin=10mm, legend=:topleft)
for idx in 1:1_000
    label = idx == 1 ? "Prior" : false
    plot!(plt_prior_1, return_periods, return_levels[idx, :]; color=:black, alpha=0.1, label=label)
end
plt_prior_1
Figure 4: Prior predictive check of return periods with revised model

Let’s Revise the Prior

\[ \begin{align*} & y \sim LogNormal(\mu, \sigma) \tag{likelihood}\\ & \left. \begin{aligned} & \mu \sim Normal(0, 0.5) \\ & \sigma \sim HalfNormal(0, 0.1) \end{aligned} \right\} \tag{priors} \end{align*} \]

Prior Predictive Check 2

Code
# sample from priors
μ_sample = rand(Normal(0, 0.5), 1_000)
σ_sample = rand(truncated(Normal(0, 0.1), 0, +Inf), 1_000)

return_periods = 2:100
return_levels = zeros(1_000, length(return_periods))
for i in 1:1_000
    return_levels[i, :] = quantile.(LogNormal(μ_sample[i], σ_sample[i]), 1 .- (1 ./ return_periods))
end

plt_prior_2 = plot(; ylabel="Return Level (m)", xlabel="Return Period (yrs)", tickfontsize=16, legendfontsize=18, guidefontsize=18, bottom_margin=10mm, left_margin=10mm)
for idx in 1:1_000
    label = idx == 1 ? "Prior" : false
    plot!(plt_prior_2, return_periods, return_levels[idx, :]; color=:black, alpha=0.1, label=label)
end
plt_prior_2
Figure 5: Prior predictive check of return periods with revised model

Compute Posterior

Code
ll(μ, σ) = sum(logpdf(LogNormal(μ, σ), dat_annmax.residual))
lprior1(μ, σ) = logpdf(Normal(0, 1), μ) + logpdf(truncated(Normal(0, 5), 0, Inf), σ)
lprior2(μ, σ) = logpdf(Normal(0, 0.5), μ) + logpdf(truncated(Normal(0, 0.1), 0, Inf), σ)
lposterior1(μ, σ) = ll(μ, σ) + lprior1(μ, σ)
lposterior2(μ, σ) = ll(μ, σ) + lprior2(μ, σ)

p_map1 = optimize(p -> -lposterior1(p[1], p[2]), [0.0, 0.0], [1.0, 1.0], [0.5, 0.5]).minimizer
p_map2 = optimize(p -> -lposterior2(p[1], p[2]), [0.0, 0.0], [1.0, 1.0], [0.5, 0.5]).minimizer

μ = 0.15:0.005:0.35
σ = 0.04:0.01:0.1
posterior1_vals = @. lposterior1', σ)
posterior2_vals = @. lposterior2', σ)

p_post1 = contour(μ, σ, posterior1_vals, 
    levels=100, 
    clabels=false, 
    cbar=false, lw=1, 
    fill=(true,cgrad(:grays,[0,0.1,1.0])),
    title = "Diffuse Prior"
)
scatter!(p_post1, [p_map1[1]], [p_map1[2]], label="MLE", markersize=10, marker=:star)
xlabel!(p_post1, L"$\mu$")
ylabel!(p_post1, L"$\sigma$")
plot!(p_post1, size=(600, 500))

p_post2 = contour(μ, σ, posterior2_vals, 
    levels=100, 
    clabels=false, 
    cbar=false, lw=1, 
    fill=(true,cgrad(:grays,[0,0.1,1.0])),
    title = "More Informed Priors"
)
scatter!(p_post2, [p_map2[1]], [p_map2[2]], label="MAP", markersize=10, marker=:star)
xlabel!(p_post2, L"$\mu$")
ylabel!(p_post2, L"$\sigma$")
plot!(p_post2, size=(600, 500))

display(p_post1)
display(p_post2)
(a) Posterior samples from surge model.
(b)
Figure 6
p_map1 = [0.25470601822948175, 0.05548961712923218]
p_map2 = [0.2546874075933288, 0.05542213686160764]

Assess MAP Fit

Code
p1 = histogram(
    dat_annmax.residual,
    normalize=:pdf,
    legend=:false,
    ylabel="PDF",
    xlabel="Annual Max Tide Level (m)",
    tickfontsize=16,
    guidefontsize=18,
    bottom_margin=5mm, left_margin=5mm
)
plot!(p1, LogNormal(p_map2[1], p_map2[2]),
    linewidth=3,
    color=:red)
xlims!(p1, (1, 1.7))
plot!(p1, size=(600, 450))

return_periods = 2:500
return_levels = quantile.(LogNormal(p_map2[1], p_map2[2]), 1 .- (1 ./ return_periods))

# function to calculate exceedance probability and plot positions based on data quantile
function exceedance_plot_pos(y)
    N = length(y)
    ys = sort(y; rev=false) # sorted values of y
    nxp = xp = [r / (N + 1) for r in 1:N] # exceedance probability
    xp = 1 .- nxp
    return xp, ys
end
xp, ys = exceedance_plot_pos(dat_annmax.residual)

p2 = plot(return_periods, return_levels, linewidth=3, color=:blue, label="Model Fit", tickfontsize=16, legendfontsize=18, guidefontsize=18, bottom_margin=5mm, left_margin=5mm, right_margin=10mm, legend=:bottomright)
scatter!(p2, 1 ./ xp, ys, label="Observations", color=:black, markersize=5)
xlabel!(p2, "Return Period (yrs)")
ylabel!(p2, "Return Level (m)")
xlims!(-1, 300)
plot!(p2, size=(600, 450))

display(p1)
display(p2)
(a) Checks for model fit.
(b)
Figure 7

What About The Posterior Distribution?

One of the points of Bayesian statistics is we get a distribution over parameters.

Sampling from this distribution is often more involved.

Exception: Conjugate Priors

When the mathematical forms of the likelihood and the prior(s) are conjugate, the posterior is a nice closed-form distribution.

Examples:

  • Normal \(p(y | \mu)\), Normal \(p(\mu)\) ⇒ Normal \(p(\mu | y)\)
  • Binomial \(p(y | \theta)\), Beta \(p(\theta)\), ⇒ Beta \(p(\theta | y)\)

Sampling using conjugate priors is called Gibbs sampling.

When Does The Prior Matter?

In general, priors matter more for:

  • Less data (likelihood less informative);
  • More complex models (more degrees of freedom).

Always justify and test your priors. Explicitly compare the prior to the posterior to see whether your inferences are driven by the prior or by the data (probability model).

Key Points and Upcoming Schedule

Key Points

  • Bayesian probability: parameters have probabilities conditional on data
  • Need to specify prior distribution (think generatively!).
  • Posterior distribution reflects compromise between prior and likelihood.
  • Maximum a posteriori gives “most probable” parameter values

Key Points: Priors

  • Use prior predictive simulations to refine priors.
  • Be transparent and principled about prior choices (sensitivity analyses?).
  • Don’t choose priors based on convenience.
  • Will talk more about general sampling later.

Upcoming Schedule

Next Classes

Next Week: Sampling! Specifically, Monte Carlo.

Assessments

Homework 2 due Friday (2/21).

Quiz: Due Monday (all on today’s lecture).

Project: Will discuss Monday, start thinking about possible topics.

References

References (Scroll for Full List)

Bayes, T. (1763). An Essay towards Solving a Problem in the Doctrine of Chance. Philosophical Transactions of the Royal Society of London, 53, 370–418.
Gelman, A., & Shalizi, C. R. (2013). Philosophy and the practice of Bayesian statistics. Br. J. Math. Stat. Psychol., 66, 8–38. https://doi.org/10.1111/j.2044-8317.2011.02037.x
Gelman, A., Simpson, D., & Betancourt, M. (2017). The prior can often only be understood in the context of the likelihood. Entropy, 19, 555. https://doi.org/10.3390/e19100555
Jaynes, E. T. (2003). Probability theory: the Logic of Science. (G. L. Bretthorst, Ed.). Cambridge, UK ; New York, NY: Cambridge University Press. Retrieved from https://market.android.com/details?id=book-tTN4HuUNXjgC
Laplace, P. S. (1774). Mémoire sur la Probabilité des Causes par les évènemens. In Mémoires de Mathematique et de Physique, Presentés à l’Académie Royale des Sciences, Par Divers Savans & Lus Dans ses Assemblées (pp. 621–656).